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Abstract 

We study distributed stochastic convex optimization under the delayed gradient model where the 
server nodes perform parameter updates, while the worker nodes compute stochastic gradients. We 
discuss, analyze, and experiment with a setup motivated by the behavior of real-world distributed 
computation networks, where the machines are differently slow at different time. Therefore, we allow 
the parameter updates to be sensitive to the actual delays experienced, rather than to worst-case 
boimds on the maximum delay. This sensitivity leads to larger stepsizes, that can help gain rapid 
initial convergence without having to wait too long for slower machines, while maintaining the same 
asymptotic complexity. We obtain encouraging improvements to overall convergence for distributed 
experiments on real datasets with up to billions of examples and features. 

I Introduction 

We study the stochastic convex optimization problem 

min f{x) :=E[F{x-,^)], (i.i) 

XGA. 

where the constraint set X c is a compact convex set, and F(-,^) is a convex loss for each 
f ^ P, where P is an unknown probability distribution from which we can draw i.i.d. samples. 
Problem (i.i) is broadly important in both optimization and machine learning [6, 12, 16-18]. It should 
be distinguished from (and is harder fhan) the finite-sum optimization problems [3, 15], for which 
sharper resulfs on fhe empirical loss are possible buf nof on fhe generalizafion error. 

A classic approach to solve (1.1) is via stochastic gradient descent (SGD) [14] (also called stochastic 
approximation [12]). At each iteration SGD performs fhe updafe x{t -|- 1 ) t— rix{x{t) — atg{x[t))), 
where Tl;^ denofes orthogonal projection onto X, scalar at > 0 is a suitable stepsize, and g{x{t)) is an 
unbiased stochastic gradient such that E[^(x(f))] G df{x{t)). 

Although much more scalable than gradient descent, SGD is still a sequential method that cannot 
be immediately used for truly large-scale problems requiring distributed optimization. Indeed, 
distributed optimization [2] is a central focus of real-world machine learning, and has affraefed 
significanf recenf research inferesf, a large parf of which is dedicafed fo scaling up SGD [1, 5, 7, 9, 13]. 

Motivation. Our work is motivated by the need to more precisely model and exploit the delay 
properties of real-world cloud providers; indeed, fhe behavior of machines and delays in such settings 
are typically quite different from whaf one may observe on small clusfers owned by individuals or 
small groups. In parficular, cloud resources are shared by many users who run variegafed fasks on 
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them. Consequently, such an environment will invariably be more diverse in terms of availability 
of key resources such as CPU, disk, or network bandwidth, as compared to an environment where 
resources are shared by a small number of individuals. Thus, being able to accommodate for variable 
delays is of great value to both providers and users of large-scale cloud services. 

In light of this background, we investigate delay sensitive asynchronous SGD, especially, for being 
able to adapt to the actual delays experienced rather than using global upper-case 'bounded delay' 
arguments that can be too pessimistic. A potential practical approach is as follows: in the beginning 
the server updates parameters whenever its receives a gradient from any machine, with a weight 
inversely proportional to the actual delay observed. Towards the end, the server may take larger 
update steps whenever it gets a gradient from a machine that sends parameters infrequently, and 
small ones if it get parameters from a machine that updates very frequently, to reduce the bias caused 
by the initial aggressive steps. 

Contributions. The key contributions of this paper are underscored by our practical motivation. 
In particular, we design, analyze and investigate AdaDelay (Adaptive Delay), an asynchronous SGD 
algorithm, that more closely follows the actual delays experienced during computation. Therefore, 
instead of penalizing parameter updates by using worst-case bounds on delays, AdaDelay uses step 
sizes that depend on the actual delays observed. While this allows the use of larger stepsizes, it 
requires a slightly more intricate analysis because (i) step sizes and are no longer guaranteed to be 
monotonically decreasing; and (ii) residuals that measure progress are not independent across time as 
they are coupled by the delay random variable. 

We validate our theoretical framework by experimenting with large-scale machine learning datasets 
containing over a billion points and features. The experiments reveal that our assumptions of network 
delay are a reasonable approximation to the actual observed delays, and that in the regime of large 
delays (e.g., when there are stragglers), using delay sensitive steps is very helpful toward obtaining 
models that more quickly converge on test accuracy; this is revealed by experiments where using 
AdaDelay leads to significant improvements on the test error (AUG). 

Related Work. An useful summary on aspects of stochastic optimization in machine learning is [i8]; 
more broadly, [12,17] are excellent references. Our focus is on distributed stochastic optimization in 
the asynchronous setting. The classic work [2] is an important reference; more recent works closest 
to ours are [1, 7, 11, 16]. Of particular relevance to our paper is the recent work on delay adaptive 
gradient scaling in an AdaGrad like framework [10]. The work [10] claims substantial improvements 
under specialized settings over [4], a work that exploits data sparsity in a distributed as}mchronous 
setting. Our experiments confirm [io]'s claims fhaf fheir best learning rate is insensitive to maximum 
delays. However, in our experience the method of [10] overly smooths the optimization path, which 
can have adverse effects on real-world data (see Section 4). 

To our knowledge, all previous works on asynchronous SGD (and its AdaGrad variants) assume 
monotonically diminishing step-sizes. Our analysis, although simple, shows that rather than using 
worst case delay bounds, using exact delays to control step sizes can be remarkably beneficial in 
realistic settings: for instance, when there are stragglers that can slow down progress for all the 
machines in a worst-case delay model. 

Algorithmically, the work [1] is the one most related to ours; the authors of [1] consider using delay 
information to adjust the step size. However, the most important difference is that they only use the 
worst possible delays which might be too conservative, while AdaDelay leverages the actual delays 
experienced. [7] investigates two variants of update schemes, both of which occur with delay. But 
they do not exploit the actual delays either. There are some other interesting works studying specific 
scenarios, for example, [4], which focuses on the sparse data. However, our framework is more general 
and thus capable of covering more applications. 
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2 Problem Setup and Algorithm 

We build on the groundwork laid by [i, ii]; like them, we also consider optimizing (i.i) under a 
delayed gradient model. The computational framework that we use is the parameter-server [9], so that 
a central server^ maintains the global parameter, and the worker nodes compute stochastic gradients 
using their share of the data. The workers communicate their gradients back to the central server 
(in practice using various crucial communication saving techniques implemented in the work of [8]), 
which updafes fhe shared paramefer and communicates it back. 

To highlight our key ideas and avoid getting bogged down in unilluminating details, we consider 
only smooth stochastic optimization, i.e., / G in this paper. Straightforward, albeit laborious 
extensions are possible to nonsmooth problems, strongly convex costs, mirror descent versions, 
proximal splitting versions. Such details are relegated to a longer version of fhis paper. 

Specifically, following [1, 6, 12] we also make fhe following sfandard assumpfions:^ 

Assumption 2.1 (Lipschitz gradients). The function / has a locally L-Lipschitz gradients. That is, 

l|V/(x)-V/(y)|| <L||x-y||, V x,yeX. 

Assumption 2.2 (Bounded variance). There exists a constant u < 00 such that 

Ef[||V/(x)-VF(x;Of] <u 2 , V x e A’. 

Assumption 2.3 (Compact domain). Let x* G argmin^g^/(x). Then, 

max IIX — X* II < R. 
xex 

Finally, an additional assumption, also made in [1] is that of bounded gradients. 

Assumption 2.4 (Bounded Gradient). Let V x G A”. Then, 

||V/(x)|| <G. 

These assumptions are typically reasonable for machine learning problems, for insfance, logisfic- 
regression losses and least-squares costs, as long as the data samples ^ remain bounded, which is 
typically easy to satisfy. Exploring relaxed versions of these assumptions would also be interesting. 

Notation: We denote a random delay at time-point t by Tf; step sizes are denoted a(f, Tf), and delayed 
gradients as g{t — Tf). For a differentiable convex function h, the corresponding Bregman divergence is 
D}i{x,y) := h{x) — h{y) — {Wh{y), x — y). For simplicity, all norms are assumed to be Euclidean. We 
also interchangeably use Xf and x(f) to refer fo fhe same quanfify. 

2.1 Delay model 

Assumption 2.5 (Delay). We consider the following fwo pracfical delay models: 

(A) Uniform: Here Tf ~ Lf({ 0 , 2 f}). This model is a reasonable approximation to observed delays 
after an inihal startup hme of fhe network. We could make a more refined assumpfion thaf for 
iterafions 1 < t < Ti, the delays are uniform on {0,..., — 1}, and the analysis easily extends to 

handle this case; we omit it for ease of presenfation. Furthermore, the analysis also extends to 
delays having distributions with bounded support. Therefore, it indeed captures a wide spectrum 
of pracfical models. 

'This server is virtual; its physical realization may involve several machines, e.g., [8]. 

'^These are easily satisfied for logistic-regression, least-squares, if the training data are bounded. 
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(B) Scaled: For each t, there is a G ( 0 , 1 ) such that < dtt. Moreover, assume that 

E[Tt] = ft, E[Tf^] = Bj, 

are constants that do not grow with t (the subscript only indicates that each Xf is a random variable 
that may have a different distribution). This model allows delay processes that are richer than 
uniform as long as fhey have bounded firsf and second momenfs. 

Remark: Our analysis seems general enough to cover many other delay distributions by combining 
our two delay models. For example, the Gaussian model (where Xt obeys a Gaussian distribution 
but its support must be truncated as f > 0 ) may be seen as a combination of fhe following: i) When 
f > C (a suifable consfanf), fhe Gaussian assumpfion indicafes Tt < 6 t, which falls under our second 
delay model; 2) When 0 < f < C, our proof technique with bounded support (same as uniform model) 
applies. Of course, we believe a more refined analysis for specific delays may help fighfen consfanfs. 


2.2 Algorithm 

Under fhe above delay model, we consider fhe following projecfed sfochasfic gradienf iferafion: 

1 ,, '-'112' 


x(f + l) ^argmin (^(f - Tf), x) + ——— ||x - x(f) ||^ 
xeX L ^0L{t,Tt) 


t = 1 2 


(2.1) 


where the stepsize a(f, Xf) is sensitive to the actual delay observed. Iteration (2.1) generates a sequence 
; the server also maintains the averaged iterate 
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(T) := ^x(f + l). 


(2.2) 


f=l 


3 Analysis 


We use stepsizes of fhe form cc(t,Tt) = (L + where fhe sfep offsefs r]{t,Tt) are chosen fo be 

sensitive fo fhe acfual delay of fhe incoming gradienfs. We fypically use 

( 3 -i) 


for some consfanf c (fo be chosen lafer). Acfually, we can also consider fime-varying Cf mulfipliers 
in (3.1) (see Corollary 3.4), buf initially for clarify of presenfafion we lef c be independenf of t. Thus, 
if fhere are no delays, fhen xt = 0 and iferafion (2.1) reduces fo fhe usual s}mchronous SGD. The 
consfanf c is used fo fradeoff confribufions in fhe error bound from fhe noise variance n, fhe radius 
bound R, and pofenfially bounds on gradienf norms. 

Our convergence analysis builds heavily on [1]. Buf fhe key difference is fhaf our sfep sizes a(f, Xf) 
depend on the actual delay Tt experienced, rather than on a fixed worst-case bounds on the maximum 
possible network delay. These delay dependent step sizes necessitate a slightly more intricate analysis. 
The primary complexity arises from a(f, Xf) being no longer independent of the actual delay Xf. This in 
turn raises another difficulty, namely that a(f, Xf) are no longer monotonically decreasing, as is typically 
assumed in most convergence analyses of SGD. We highlighf our fheorefical resulf below, and due fo 
space limifafion, all fhe auxiliary lemmas are moved fo appendix. 

Theorem 3.1. Let x{t) be generated according to (2.1). Under Assumption 2.5 (A) (uniform delay) we have 


E 


J2l,{f{x{t + l))-f{x*))\ < [V 2 cRH+-j Vf + 

-|- 2 (L “F c)R^ “F fGR “F 


(T^\ ^ . LG2(4x-F3)(x-F1) 


6c^ 


logT 


LG2x(x-F1)(2x-F1)^ 
6(L2-Fc2) 
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while under Assumption 2.5 (B) (scaled delay) we have 


E 




1 


“ c ^ ti 


+ GR 


T-l 


1+E 




S {T-ty 


2 ^ Bf + 1 + Ti 
“ + c2(l - 0t)t 


1r2(l + c). 


Proof Sketch. The proof begins by analyzing the difference f{x{t + 1 )) — f{x*); Lemma A.2 (provided 
in the supplement) bounds this difference, ultimately leading to an inequality of the form: 


E 


J 2 l,{f{xit + l))-f{x*)) <E ^f^^A(t)+r(t)+E(t) 


The random variables A(t), r(t), and E(t) are in turned given by 


A(t) := 

r(t) := 

m ■■= 


[||x* -x(t)f - \\x*-x{t + l)\y 

(V/(x(t)) - V/(x(t - Tt)), xit + 1 ) - X*); 
II V/(x(t - Tt)) - g(t - Tt) f . 


( 3 - 2 ) 

( 3 - 3 ) 

(34) 


Once we bound these in expectation, we obtain the result claimed in the theorem. In particular. 
Lemma A.3 boimds (3.2) under Assumption 2.5(A), while Lemma A.4 provides a bound under 
the Assumption 2.5(B). Similarly, Lemmas A.5 and Lemma A.6 boimds (3.3), while Lemma A.7 
bounds (3.4). Combining these bounds we obtain the theorem. □ 


Theorem 3.1 has several implications. Corollaries 3.2 and 3.3 indicate both our delay models share 
a similar convergence rate, while Corollary 3.4 shows such results still hold even we replace the 
constant c with a bounded (away from zero, and from above) sequence {Cf} (a setting of great practical 
importance). Finally, Corollary 3.5 mentions in passing a simple variant that considers r/t = Ct(t + p)^ 
for 0 < f < 1 . It also highlights the known fact that for f = 0 . 5 , the algorithm achieves the best 
theoretical convergence. 

Corollary 3.2. Let Tt satisfy Assumption 2.5 (A). Then we have 


E[/(«r) - /I = O + D2^ + Dgi 


where 


Di = V 2 cR}t + —, D2 = 


LG 2 ( 4 t + 3 )(t+l) 


,D3 = l{L + c)R^ + fGR + 


LG2t(t+l)(2t + l)2 


c - 6 c 2 2^- ' V- ' — ' 6 (L 2 + c 2 ) 

The following corollary follows easily from combining Theorem 3.1 with Lemma A.6. 
Corollary 3.3. Let Tt satisfy Assumption 2.5 (B); let ft = t, 6 t = d, and Bt = Bfor all t. Then, 

mi^T)-n = 0 \ D 4 ^ + ^ ^ +De^ 


where 


D 4 = 


1 Cr2 

^cR^(t + l) + - 


/ D5 = 


G2(B2 + T + 1) 
C2(l -0) 


,De = 1{L + c)R^ + Gr(i 


n^B^ 
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Corollary 3.4. If rjt = Cty/t + Tt with 0 < Mi < Ct < M2, then the conclusion of Theorem 3.1, Corollary 3.2 
and 3.3 sh 7 / /ioW, except that the term c is replaced by M2 and ^ by 

If we wish to use step size offsets rjt = Ct(f + Xf)^ where 0 < f < 1 , we get a result of the form (we 
report only the asymptotically worse term, as this result is of limited importance). 

Corollary 3.5. Let rjt = Ct(t + Xf)^ with 0 < Mi < Ct < M2 and 0 < f < 1 . Then, there exists a constant 
D7 such that 


4 Experiments 

We now evaluate the efficiency of AdaDelay in a distributed environment using real datasets. 

Setup. We collected two click-through rate datasets for evaluation, which are shown in Table 1. One 
is the Criteo dataset^, where the first 8 days are used for training while the following 2 days are used 
for validation. We applied one-hot encoding for category and string features. The other dataset, named 
CTR2, is collected from a large Internet company. We sampled 100 million examples from three weeks 
for training, and 20 millions examples from the next week for validation. We extracted 2 billion imique 
features using the on-production feature extraction module. These two datasets have comparable size, 
but different example-feature ratios. We adopt Logistic Regression as our classification model. 



training example 

test example 

unique feature 

non-zero entry 

Criteo 

1.5 billion 

400 million 

360 million 

58 billion 

CTR2 

110 million 

20 million 

1.9 billion 

13 billion 


Table 1: Click-through rate datasets. 

All experiments were carried on a cluster with 20 machines. Most machines are equipped with 
dual Intel Xeon 2.40GHz CPUs, 32 GB memory and 1 Gbit/s Ethernet. 

Algorithm. We compare AdaDelay with two related methods AsyncAdaGrad [1] and AdaptiveRevi- 
sion [10]. Their main difference lies in the choice of the learning rate at time t: a(f, Xt) = (L + rj{t, xt))“^. 
Denote by rij{t, Xf) the j-th element of t]{t, Xf), and similarly gj(t — Xf) the delayed gradient on feature j. 

As5mcAdaGrad adopts a scaled learning rate r]j{t,Tt) = gj{i, h)- AdaptiveRevision takes into 

account actual delays by considering g^‘^^{t,Tt) = Y!iZj_^gj{i,Ti). It uses a non-decreasing learning 

rate based on gj{i, h) + T^t)g^^^{t, Tt). Similar to As5mcAdaGrad and AdaptiveRevision, 

we use a scaled learning rate in AdaDelay to better model the nonuniform sparsity of the dataset (this 
step size choice falls within the purview of Corollary 3.4). In other words, we set T]j{t,Tt) = Cjy/t + Tt, 

where Cj = Y!i=i j^.gji^ ~ h) averages the weighted delayed gradients on feature j. We follow 

the common practice of fixing L to 1 while choosing the best a(t, Tt) = + vi^' ^ S^d 

search over xq. 

^http://labs.criteo.com/downloads/download-terabyte-click-logs/ 
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Implementation. We implemented these three methods in the parameter server framework [9], which 
is a high-performance as}mchronous communication library supporting various data consistency 
models. There are two groups of nodes in this framework: workers and servers. Worker nodes 
run independently from each ofher. At each time, a worker first reads a minibatch of data from 
a distributed filesystem, and then pulls the relevant recent working set of parameters, namely the 
weights of the features that appear in this minibatch, from the server nodes. It next computes the 
gradients and then pushes these gradients to the server nodes. 

The server nodes maintain the weights. For each feature, both AsyncAdaGrad and AdaDelay store 
the weight and the accumulated gradient which is used to compute the scaled learning rate. While 
AdaptiveRevision needs two more entries for each feature. 

To compute the actual delay t for AdaDelay, we let the server nodes record the time t{w, i) when 
worker w is pulling the weight for minibatch i. Denote by t'{w, i) the time when the server nodes are 
updating the weight by using the gradients of this minibatch. Then the actual delay of this minibatch 
can be obtained by t'{w, i) — t{w, i). 

AdaptiveRevision needs gradient components for each feature j to calculate its learning rate. 
If we send over the network by following [10], we increase the total network communication by 
50 %, which harms the system performance due to the limited network bandwidth. Instead, we store 
at the server node during while processing this minibatch. This incurs no extra network overhead, 
however, it increases the memory consumption of the server nodes. 

The parameter server implements a node using an operation system process, which has its own 
communication and computation threads. In order to run thousands of workers on our limited 
hardware, we may combine server workers into a single process to reduce the system overhead. 

Results. We first visualize the actual delays observed at server nodes. As noted from Figure 1, delay 
Tf is aroimd 6 t at the early stage of the training, while the constant 6 varies for different tasks. For 
example, it is close to 0.2 when training the Criteo dataset with 1,600 workers, while it increases to 
1 for the CTR2 dataset with 400 workers. After the delay hitting the value u, which is often half of 
fhe number of workers, it behaves as a Gaussian distribution with mean u, which are shown in the 
bottom of Figure 1. 

Next, we present the comparison results of these three algorithms by varying the number of 
workers. We use fhe AUG on fhe validation dataset as the criterion^, often 1% difference is significant 
for click-through rate estimation. We set the minibatch size to 10 ^ and 10 ^ for Criteo and CTR2, 
respectively, to reduce the communication frequency for better system performance^. We search ag in 
the range [ 10 “^, 1 ] and report the best results for each algorithm in Figure 2. 

As can be seen, AdaptiveRevision only outperforms AsyncAdaGrad on the Criteo dataset with a 
large number of workers. The reason why it differs from [10] is probably due to the datasets we used 
are 1000 times larger than the ones reported by [10], and we evaluated the algorithms in a distributed 
environment rather than a simulated setting where a large minibatch size is necessary for the former. 
However, as reported [10], we also observed that AdaptiveRevision's best learning rate is insensitive 
to the number of workers. 

On the other hand, AdaDelay improves AsyncAdaGrad when a large number of workers (greater 
than 400 ) is used, which means the delay adaptive learning rate takes effect when the delay can be 
large. 

To further investigate this phenomenon, we simulated an overloaded cluster where several strag¬ 
glers may produce large delays; we do this by slowing down half of the workers by a random factor 
in { 1 , 4 } when computing gradients. The results are shown in Figure 3. As can be seen, AdaDelay 

"^We observed similar results on using LogLoss. 

^Probably due to the scale and the sparsity of the datasets, we observed no significant improvement on the test AUC when 
decreasing the mmibatch size even to 1. 
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(a) Criteo dataset with 1,600 workers 




(b) CTR2 dataset with 400 workers 

Figure 1: The observed delays at server nodes. Left column: the first 3,000 delays on one server node. 
Right column: the histogram of all delays. 




Figure 2: Tesf AUC as funcfion of maximal delays. 


consisfently outperforms As5mcAdaGrad, which shows fhaf adaptive modeling of fhe acfual delay is 
better fhan using a consfanf worsf case delay when fhe variance of fhe delays is large. 

Finally we report the system performance. We first present the speedup from 1 machine to 16 
machines, where each machine runs 100 workers. We observed a near linear speedup of AdaDelay, 
which is shown in Figure 4. The main reason is due to the asynchronous updating which removes the 
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(b) CTR2 


Figure 3: Test AUC as function of maximal delays wifh fhe exiling of sfragglers. 



AdaDelay 

As5mcAdaGrad 

AdaptiveRevision 

Crifeo 

24GB 

24 GB 

55 GB 

CTR2 

97 GB 

97 GB 

200 GB 


Table 2: Tofal memory used by server nodes. 


dependencies between worker nodes. In addition, using mulfiple workers wifhin a machine can fully 
utilize fhe compufafional resources by hiding fhe overhead of reading dafa and communicafing fhe 
paramefers. 

In the parameter server framework, worker 
nodes only need to cache one or a few data mini¬ 
batches. Most memory is used by the server 
nodes to store the model. We summarize the 
server memory usage for the three algorithms 
compared in Table 2. 

As expected, AdaDelay and AsyncAdaGrad 
have similar memory consumption because the 
extra storage needed by AdaDelay to track and 
compute the incurred delays p is tiny. How¬ 
ever AdaptiveRevision doubles memory usage, 
because of the extra entries that it needs for each 
feature, and because of fhe cached delayed gra- 
dienf 

5 Conclusions 

In real distributed computing environment, there are multiple factors contributing to delay, such as the 
CPU speed, I/O of disk, and nefwork fhroughpuf. Wifh fhe inevifable and sometimes unpredicfable 
phenomenon of delay, we considered disfribufed convex opfimizafion by developing and analyzing 
AdaDelay, an as5mchronous SGD mefhod fhaf folerafes sfale gradienfs. 

A key component of our work that differs from existing approaches is the use of (server-side) 
updafes sensifive fo fhe acfual delay observed in fhe nefwork. This allows us fo use larger sfepsizes 
initially, which can lead to more rapid initial convergence, and stronger ability to adapt to the 



Figure 4: The speedup of AdaDelay. 
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environment. We discussed details of two different realistic delay models: (i) uniform (more generally, 
bounded support) delays, and (ii) scaled delays with constant first and second moments but not- 
necessarily bounded support. Under both models, we obtain theoretically optimal convergence 
rates. 

Adapting more closely to observed delays and incorporating server-side delay sensitive gradient 
aggregation that combines the benefits of the adaptive revision framework [lo] with our delayed 
gradient methods is an important future direction. Extension of our analysis to handle constrained 
convex optimization problems (without requiring a projection oracle onto the constraint set) is also an 
important part of future work. 
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A Technical details of the convergence analysis 

We collecf below some basic fools and definifions from convex analysis. 

Definition A.i (Bregman divergence). Let h : X x X ^ [ 0 ,c>o] he differentiable strictly convex function. 
The Bregman divergence generated by h is 

Dh{x,y) :=h{x)-h{y) - {Vh{y), x-y), x,y e X. (A.i) 

- Fenchel conjugate: 

f*{y) = sup(x,y) -f{x) (A.2) 

xeX 

- Prox operator: 

proxTx) = argmin/(y) +-||x-y||^, x e X (A.3) 

yGX ^ 

- Moreau decomposition: 

X = proX|(x) +prox^, (x), V x e A” (A.4) 

- Fenchel-Young inequality: 

(x, y) </(x)+/*(y) (A.5) 

- Projection lemma: 

^-n^(y)) < 0, \fxex. (a.6) 

- Descent lemma: 

/(y) < fix) + (V/(x), y - x) + I ||y - xf. (A.7) 

- Four-point identity: Bregman divergences satisfy fhe following /owr point identity: 

i'Vhia) - Vh{b), c - d) = Dfj{d,a) - D}j{d,b) - Dfj{c,a) + Df^{c,b). (A.8) 

A special case of (A.8) is fhe "fhree-poinf" idenfify 

{\ 7 h{a) -\/h{b), b - c) = Dh{c,a) - Dh{c,b) -Dh{b,a). (A.9) 
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A.i Bounding the change f{x{t + 1 )) — f{x*) 

We start the analysis by bounding the gap f{x{t + 1 )) — f{x*). The lemma below is just a combination 
of several results of [i]. We presenf fhe defails below in one place for easy reference. The impacf of 
our delay sensifive sfep sizes shows up in subsequenf lemmas, where we boimd fhe individual ferms 
fhaf arise from Lemma A.2. 

Lemma A.2. At any time-point t, let the gradient error due to delays be 

et:=Vf{x{t))-g{t-Tt). (A.io) 


Then, we have the following (deterministic) bound: 


/(x(f + l))-/(x*) 

1 


< 


2a(f,p) L 

1 


X* - x(f)||^ - ||x* - x{t + 1 ) 11 ^ + {et, x{t + 1 ) - X*) + ^ 


||x* - x(f)||^ - 11 ^* - x{t + 1 ) 11 ^ + (V/(x(f)) - V/(x(f - Tf)), x(f + 1 ) - X*) 


2oi{t,Tt) L 

+ (V/(x(f-Tf)) -g{t-Tt), x(f) -x*) + 2^^||V/(x(f-Tt)) -g{t-Tt)f. 
Proof. Using convexify of / we have 

f{xt)-f{x*) < (V/(x(f)), x(f+ 1 ) -X*) + (V/(x(f)), x(f) -x(f + l)). 


(A.ii) 


(A.12) 


Now apply Lipschifz confinuify of V/ fo fhe second ferm fo obfain 


/(xt)-/(x*)< (V/(x(f)),x(f + l) 
/(x(f + l))-/(x*)< (V/(x(f)),x(f + l) 


X*) +/(x(f)) -/(x(f + l)) + |||x(f) -x(f + l)f, 
X*) + j||x(f) -x(f + l)||^. 

(A.13) 


Using fhe definition (A.10) of fhe gradient error et, we can rewrite (A.13) 

/(x(f + 1 )) - f{x*) < {g{t - Tf), x(f + 1 ) - X*) + {et, x(f + 1 ) - X*) +|||x(f) - x(f + 1 ) 11 ^. 

^-^ " 

T1 T2 

To complete the proof, we bound fhe ferms T 1 and T 2 separafely below. 

Bounding Ti: Since x(f + 1 ) is a minimizer in (2.1), from fhe projection inequalify (A.6) we have 

(x(f) - oc{t,Tt)g{t - Tf) - x(f + 1 ), X - x(f + 1 )) < 0 , Vx e -T. 

Choose X = X*; then rewrite the above inequality and identity (A.9) with h{x) = j ||x||^ to get 

‘^{t,Tt){git-Tt), X(f + 1) -X*) < (x(f) -X(f + 1), X(f + 1) -X*) 

= l\\x* -x(f)||^- jllx* -x(f + l)||^- 2||x(f + l) -x(f)||^; 

Plugging in this bound for Tl and collecting fhe ||x(f + 1 ) — x(f) |p ferms we obfain 


/(x(f + l))-/(x*) 


< 


1 



1 

2a(f,Tf) 



X(f)||2 

x(f)f 


X(f + l)||2 
X(f + l)||2' 


- ||x(f + 1) - x(f)||^ 
+ {et, x{t + 1 ) - X*) 


+ {et, x(f + l) — X*) + 2||x(f) —x(f + l)||^ 
+ ^■-U^^(t,rt) [|x(^) -x(t + l)||2. 


(A.I4) 
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Bounding Tz: Adding and subtracting V/(x(t — Tf)) we obtain 


{et, x{t + l) -X*) = (V/(x(t)) - g{t-Tt), x{t + l) -X*) 

= (V/(x(t)) -Vf{x{t - Tf)), x(t+ 1 ) - X*) + {Vf{x{t - Tf)) -g{t - Tf), x{t + l) - X*) 
= (V/(x(t)) - Wf{x{t - Tf)), x{t + l) - X*) + {Vf{x{t - Tf)) -g{t - Tf), Xt - X*) 

+ - Tf)) - g{t - Tf), X{t + 1 ) - X{t)) 

< (V/(x(t)) - Vf{x{t - Tf)), x(t + 1 ) - X*) + (V/(x(t - Tf)) -^(t - Tf), x{t) -X*) 


where the last inequality is an application of (A.5). Adding this inequality to (A.14) and using 
l/a(t, Tf) = L +//(t, Tf), we obtain (A.ii). □ 


The next step is to take expectations over (A. 11) and then further bound the resulting terms 
separately Note that V/(x(f — Tf)) — g{t — Tf) is independent of x{t) given ^( 1 ),.. ■ ,g{t — Tf — 1 ) 
(since x{t) is a function of gradients up to time t — zt — 1 ). Thus, the third term in (A.11) has zero 
expectation. It remains to consider expectations over the following three quantities: 


A(f) := 

r(f) := 

m ■■= 


^^[||x*-x(f)f-||x*-x(f + l)f_ 

(V/(x(f))-V/(x(f-Tf)),x(f + l)-x*); 
2,^||V/(x(f-Tf)) -^(f-Tf)f. 


(A.I5) 

(A.16) 

(A.I7) 


Lemma A.3 bounds (A.15) under Assumption 2.5(A), while Lemma A.4 provides a bound under the 
Assumption 2.5(B). Similarly, Lemmas A.5 and A.6 boimd (A.16), while Lemmas A.7 boimds (A.17). 
Combining these bounds we obtain the theorem. 


A.2 Bounding A, T, and E 

Lemma A.3. Let A(f) be given by (A.15), and let Assumption 2.5 (A) hold. Then, 

|;E[A(f)] = l|:E[^^(||x*-x(f)f-||x*-x(f + l)f)] < ^^iL + c)R^ + V2cR^fVf. 


Proof. Unlike the delay independent step sizes treated in [1], bounding A(f) requires some more work 
because alt, Zt) depends on Tf, which in turn breaks the monotonically decreasing nature of a(f, Tf) 
(we wish to avoid using a fixed worst case bound on the steps, to gain more precise insight into the 
impacts of being sensitive to delays), necessitating a more intricate analysis. 

Let If = ||x(f) — X* 11 ^. Observe that although if _LL Tf, it is not independent of z{t — 1 ). Thus, with 


Zt 


1 

a(f,Tf) 


1 

a(f-l,Tf_i) 


= c{^/t + Zt 


\/f “ 1 + Tf_l), 


we have 


T 


^E[A(f)] 

t=i 


-E 




La(l,T(l)) 



(A.18) 


Since a(f, Zt) is not monotonically decreasing with t, while upper-bounding E[A(f)] we cannot simply 
discard the final term in (A.18). 
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When T{t — 1) ~ Lr({0,2 t}), rt uniformly takes on at most 2t + 1 values 

rt,s-=\\xt,s-x*\f, s e [2f], 

where xt^s = ^x[xt-i — x{t — l,T(t — 1) = s)g{t — l,T(t — 1))]. Given a delay T(t — 1) 
rt,s- Using Zf = — a.{t — 1)“^ = Cy/t + Tt — Cy/t — 1 + Tt_i, we have 

Zt,s = c - y/t -l+s'j , s e [ 2 t]. 

Using nested expectations ]E[ztrt] = Ex, []E[ztrf|Tt]] we then see that 

If (If 


E[ztrt] = —^ [ ^( 2 t + 1 ) ^rt,sc(Vt + l-Vt-l+s 
^ /=o Vs=o ^ 

- 2 tTT ^ Vt-l + s 


1=0 \s=0 

where we dropped the terms with s > I as they are non-positive. 
Consider now the inner summation above. We have 

/-I 


2f + l 

< 


y~! Tt,s ( \/t + 1 ~ y/1 — 1 s 
s=0 ^ 

cR^ '"1 


+ 1 S=0 

cR^ '"1 


(^y/t + l - y/t -1 
l-s + 1 


-y ,_ ,_ 

2t -|- 1 y/T-\-l -|- y/1 — 1 S 


< 


cR^ 


i-i 


TTlrn." 


2f + l y/2t - 1 s'to 
cR2 1 3 / + /2 

2t-My2F^ 2 


Thus, we now consider 


2t 


cK2 


31+ f 


cR^ 


< 


{2f + iyy/2t-l 

2 cRH 


( 2 t-hl)( 4 t-F 2 . 5 )t 


v/2f^' 

Summing over t = 2 to T, we finally obtain the upper bound 

1 


EE[zht] < 

t=2 t=2 V 2t - 1 


< 2 cR^fV 2 f. 


Lemma A.4. Let Assumption (2.5) (B) hold. Then 


YE[A{t)]<-R\L + c) + -cR^Y 


t=i 


t=2 


Tt + 1 

y/W^' 


= S, Tt is just 


□ 
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Proof. Proceeding as for Lemma A.3, according to (A.18), the task reduces to bounding E[ztrf]. 
Consider thus, 

E[zfrt] < E[z+rt] < _R^E[z^], 

where we use zf to denote max(zt, 0 ). Let us now control the last expectation. Let Pt{l) = P(T(t) = 1 ), 
then 


So 


Lemma A.5. 


f’(h/h-i)max( 0 ,Zf) 
t-lt-2 


= c E E Pt{m-i{s)[vm -vt-i 


l + l-s 


1=0s=0 

= c'j2LPtim-iis) 

1=0s=0 

<c'j2LPtim-iis) 

1=0s=0 

t-l 7 I 1 

, V (n ^ + 1 _ h + 1 

“^,5 ‘ ^V2rpi ~ ‘^V2rPi' 


y /1 1 + y /1 — 1 + s 

1 + 1 

V2f + 1 - 1 


t=2 t=2 V 2r - i 


T 


EE[rW] 

t=i 


T 


E E [(V/(x(t)) - V/(x(t - Tt)), x{t + 1) 

t=i 


< tGR + 


LCi 


LC2 

~Y~ 


logT 


X*)] 


□ 


where 


Cl = 


G^t(t + 1 )( 2 t+ 
3 (L 2 + c2) 


flwrf C2 = 


G 2 ( 4 f+ 3 )(f+ 1 ) 
3 c 2 


Proof. This proof is an adaptahon of Lemma 4 and Corollary 1 of Agarwal and Duchi [1]. First, we 
exploit convexity of / to help analyze the gradient differences using the four-point identity (A.8): 


(V/(x(t)) - V/(x(t - Tt)), x(f + 1 ) - X*) 

= Df{x*,x{t)) - Df{x*,x{t - Tt)) - D^(x(t + l),x(t)) + D^(x(t + l),x(t - p)). 

Since V/ is L-Lipschitz, we further have 

/(x(t + 1 )) < f{x{t - Tt)) + (V/{x(t- Tt)), x(t+ 1 ) -x(t - Tt)) + j||x(t - Tt) -x(t+ 1 )||^. 
By definition of a Bregman divergence, we also have 

D|(x(t+ l),x(t- Tt)) = /(x(t+ 1 )) -/(x(t - Tt)) - (V/(x(t - Tt)), x(t + 1 ) - x(t- Tt)), 
which, upon using using A.7, immediately yields the bound 

Df{x{t + l),x{t-Tt)) < |||x(t-Tt) - x{t + l)f. 
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Dropping the negative term Dj-{x{t + l),x{t)) from (A.19) and summing over t, we then obtain 
^(V/(x(f)) - V/(x(t - Tt)), x{t + 1 ) - X*) 

f=l 

Df{x*,x{t))-Df{x*,x{t-Tt)) + ^ ^ ||x(t + l) -x(t-Tf)f. 


t=l 


f=l 


Notice that the first sum partially telescopes, leaving only the terms not received by the server within 
the first T iterations. Thus, we obtain the bound 


E Df{x*,x{t)) + -f 2 \\x{t + l)-x{t-Tt) 


(A.20) 


t:t+Ti>T 


t=l 


We bound both each of fhe ferms in (A.20) in furn below. 

To bound fhe confribufion of fhe firsf ferm in expecfafion, compufe fhe expecfed cardinalify 


E[|{f:f + Tf>T}|] = EPr(Tt>T-f), 

t=l 

Assuming delays uniform on { 0 , 2 f} bounding fhis cardinalify is easy, since 


(A.21) 


Pr(Tt > T - t) 


0 

2f-T+t 

2 f+l 


T - f > 2t, 
ofherwise. 


Assuming fhaf 2 t + 1 < T, (A.21) becomes (unsurprisingly) 


^ 2t — s _ (4t — 2t)(2t + 1 ) 
2t + l ^ 2(2t + l) 


From definifion of a Bregman divergence we immediafely see fhaf 

0 < Df{x\x{t)) < -(V/(x(f)), x*-x{t)) < ||V/(x(f))||||x* -x(f)|| < GR. 

Thus, fhe confribufion of fhe firsf ferm in (A.20) is bounded in expecfafion by by fGR. 
To bound fhe confribufion of fhe second ferm, use convexify of || ■ |p fo obfain 

||x(f + l) -X(f-Tt)|| 

= ||x(f + 1) — x{t) + x{t) — x{t — 1) + ■ ■ ■ + x{t — Tf + 1) — x(t — Tt)||^ 

T( 

<(h + l)^E 

s=0 

If 

= (h + l) E \\^x{xit-s) - x{t - S,Tt-s)g{t - S,Tt-s)) -n^(x(f-s))f 

s=0 

Tt 

<(Tt + l)G^ E“(^“S,Tf_s)^. 

s=0 


Condifioned on fhe delay Tt we have 

E[||x(f + 1 ) - X{t - Tt)\\^\Tt] < (Tt + l)G^ElLo®[‘^(^“®''^f-s)^]- 


Under fhe uniform or scaled assumpfions on delays, we obfain similar bounds on fhe above quanfify. 
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Consider now the expectation 


E[a(t 


s,T(t-s))^] = E[' 


+ c^{{t — s) + T{t — s)) + 2LCy/1 — s + T{t — s) 

ifTt = /, f^E[tt(t-S,Tt-s)^] 

s=0 “-n ^ T C 


s=0 


(t-l) 


1 

“ L'^ + c^{t — s) 

1+1 

L2 + c2(t-/)' 


Thus, for t > 2 t, we have the following bound 


E[||x(f + 1) -x(f-Tt)f] 


< 


2f 


E 

1=0 


1 (1 + 1)2 

2 t+l L 2 + c 2 (f-/) 


^ ( 2 t + l)(L 2 + c 2 (f- 2 t)) 

_ G2(4t + 3)(t + l) 

“ 3(L2 + c2(f-2t))' 


and for t <2 t, we have 


E[||x(f + 1) -X(f-Tf)||2] < G2^Pf(/)pE^E™ 


t-l 


<G2E 

1=0 


(1+1)2 
L2 + c2 


G2f(f + l)(2f + l) 
6(L2 + c2) 


Now adding up over f = 1 fo T, we have 


^E[||x(f + l)-x(f-Tf)||2] <Ci + C 2 logT 
t=l 


Lemma A.6. Assuming scaled delays, we have the bound 


Y: ]E[r(f)] = ^ E [(V/(x(f)) - V/(x(f - tO), X{t + 1) - X*)] 

t=i t=i 


/ T-l r2 

< GK 1 + ^ f 

V t=i 


_ I iq2 Y + 1 + Tf 

(T-t)2 ^^^L2 + c2(l-0,)f 


□ 


Proof. We build on Corollary i of [i], and proceed as in Lemma A .5 to bound the terms in (A. 20 ) 
separately. For the first term, we bound the expected cardinality using Chebyshev's inequality and 
Assumption 2.5 (B): 


E[|{<:f + T,>T}|] = EPr(T,>T-<)<l+E7|% = l+E7T^ 

t=l t=l W t=l + 

To bound the second term, we again follow Lemma A .5 fo obfain 

E[||x(f + 1 ) -x(f-Tt)|| 2 |Tt] < (Tt + l)G 2 ^J'^gE[a(f-S,Tf_s) 2 ]. 
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1 


E[a{t — s,T{t — s))^] = ]E[ 


< 


L'^ + C^{{t — s) + T(f — s)) + 2LCy/1 — S + T{t — s) 

1 


+ c‘^{t — s)' 
which yields the bound (since 0 < s < Tf) 

]E[||x(t + l) - X{t - Tt)\\^\Tt] < 


L^ + c^{t - Tf) 


Now adding up over t = 1 to T consider 


G^ELi 


(Tf + 1)2 


L 2 + c 2 (t - Tf)' 

SO that taking expectation (over Tf) we then obtain 


f=i t=i 


(Tt + 1)2 


_L 2 + C^{t - Tf) 

Using our assumption that Tf < 9 tt for 6 t G ( 0 , 1 ), we have in particular that 

(Tf + 1)2 


G2^E 


^ [L 2 + c 2 (t-Tf) 


<G^E 

f=l 

T 

<G^E 


L2 + c2(l-0f)f 
B2 + 1 + ft 

j^jL2 + c2(l-0t)f 


E[(Tt + l)2 


Lemma A.7. Let ffze step-ojfsets be r]{t, Tf) = c-^/f + Tf. For any delay distribution we have 

£E[E(f)]<^yT. 


f=i 


Proof. From Assumption 2.2 on the variance of sfochasfic gradienfs, if follows fhaf 


E[ 2 :(f)] = E 


_2t]{t,Tt 


■\\Vf{x{t-Tt))-g{t-Tt) 


<ylE 




-1 


Plugging in y(f, Tf) = c-y/f + Tf, clearly fhe bound 


1 1 1 1 

-E[(f + Tf)-i/2] = J2p(^s)-= < — 

Cs“o V^ + S cVf 


holds for any delay disfribufion. Summing up over t, we fhen obfain 

T ^2 T 1 J 




t=i 


□ 


(A.22) 


□ 


18 
























B More general step-sizes 


If we use the offsets rjt = c{t + Tt)^, where 0 < < 1 , we obtain slightly more general step sizes that 

fit within our framework. The only benefit of considering stepsizes other than /3 = 1 /2 is because they 
allow us to tradeoff the contributions of the various terms in the bounds, and for a larger value of j6 
for instance, we will obtain smaller step sizes, which can be beneficial in high noise regimes, at least in 
the initial iterations. The theoretical sweet-spot (in terms of dependence on T), is, however fi = 1 / 2 , 
the choice analyzed above. We summarize below the impact of these steps sizes for non-uniform 
scaled delays; the uniform case is even simpler. For simplicity, we do not bound the terms as tightly as 
for the special case /5 = 1/2. 

Lemma B.i. Assume that Tt satisfies Assumption 2.5 (B) and pt = c{t + and 0 < < 1 . Then, 


E[z+] < 

E[||x(t-Fl)-x(t-Tt)f] < 
E[r]{t,Tt)~^] < 


+1) 

(t-l)i-/^ 

GHrt + lf 

L2 + C^{t — Ttfih 

1 

Ct?‘ 


Proof. Proceeding as in Lemma A.4 we bound 


E[z^] = c E E Ptim-iis) Ut+i)f^-{t-i+s)^) 

1=0s=0 

/=0s=0 P 


< 


f-1 


1=0 


1 + 1 

(t-i)i-/^ 


cf{Tt + 1 ) 

{t-iy-r 


(B.i) 

(B.2) 

(B.3) 


where the first inequality follows from concavity if t^, the second one since decreasing in 

s, while the third is clear as Pf-i is a probability. 

Next, we bound (B.2). Proceeding as in Lemma A.6, we obtain the bounds 


E[||i(( + 1) - i(( - t,)|P|t,| < 


Finally, the bound on (B.3) is trivial; since rj^ ^ = c ^(t -F p) we have 


lEKt + p)-^] 


1 

c 


t-i 


EEt(s) 

s=0 


1 

(tT^ 



□ 


Using these key bounds, we can defined full versions of Lemmas A.4, A.6, and A.7, where we 
finally we will need a bound of the form 


T 


E 

f=i 


th 


< 1 



1 + 


_ 1^ 


< 


^ pl-g 
1-/5 ■ 
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